Hyperparameter Tuning
Published:
It is important to use a separate validation dataset from the train and test sets for tuning model hyperparameters. (ref)
For any hyper-parameter that has an impact on the effective capacity of a learner, it makes more sense to select its value based on out-of-sample data (outside the training set), e.g., a validation set performance, online error, or cross-validation error. (ref)
It is also important to not including the validation dataset in the evaluation of the performance of the model. (ref)
Once some out-of-sample data has been used for selecting hyper-parameter values, it cannot be used anymore to obtain an unbiased estimator of generalization performance, so one typically uses a test set (or double cross-validation, in the case of small datasets) to estimate generalization error of the pure learning algorithm (with hyper-parameter selection hidden inside). (ref)
Note that cross-validation is often not used with neural network models given that they can take days, weeks, or even months to train. (ref)
Nevertheless, on smaller datasets where cross-validation can be used, the double cross-validation technique is suggested, where hyperparameter tuning is performed within each cross-validation fold. (ref)
Double cross-validation applies recursively the idea of cross-validation, using an outer loop cross-validation to evaluate generalization error and then applying an inner loop cross-validation inside each outer loop split’s training subset (i.e., splitting it again into training and validation folds) in order to select hyper-parameters for that split. (ref)
Hyperparameters:
The Hyper parameters in deep learning can be divided into two groups: the learning hyperparameters and the model hyperparameters. (ref)
Learning Hyperparameters:.
The hyperparameters in the suite are:
- Initial Learning Rate. The proportion that weights are updated; 0.01 is a good start.
- Learning Sate Schedule. Decrease in learning rate over time; 1/T is a good start.
- Mini-batch Size. Number of samples used to estimate the gradient; 32 is a good start.
- Training Iterations. Number of updates to the weights; set large and use early stopping.
- Momentum. Use history from prior weight updates; set large (e.g. 0.9).
- Layer-Specific Hyperparameters. Possible, but rarely done.
Note that the learning rate is presented as the most important parameter to tune. Although a value of 0.01 is a recommended starting point, dialing it in for a specific dataset and model is required. (ref)
We should note that the batch size is presented as a tool to control on the speed of learning, not about tuning test set performance (generalization error). In theory, this hyper-parameter should impact training time and not so much test performance, so it can be optimized separately of the other hyperparameters, by comparing training curves (training and validation error vs amount of training time), after the other hyper-parameters (except learning rate) have been selected. (ref)
Model Hyperparameters:.
The model hyperparameters are:
- Number of Nodes. Control over the capacity of the model; use larger models with regularization.
- Weight Regularization. Penalize models with large weights; try L2 generally or L1 for sparsity.
- Activity Regularization. Penalize model for large activations; try L1 for sparse representations.
- Activation Function. Used as the output of nodes in hidden layers; use sigmoidal functions (logistic and tang) or rectifier (now the standard).
- Weight Initialization. The starting point for the optimization process; influenced by activation function and size of the prior layer.
- Random Seeds. Stochastic nature of optimization process; average models from multiple runs.
- Preprocessing. Prepare data prior to modeling; at least standardize and remove correlations.
Configuring the number of nodes in a layer is challenging and perhaps one of the most asked questions by beginners. Yoshua Bengio suggests that using the same number of nodes in each hidden layer might be a good starting point. (ref)
In a large comparative study, we found that using the same size for all layers worked generally better or the same as using a decreasing size (pyramid-like) or increasing size (upside down pyramid), but of course this may be data-dependent. (ref)
For the first hidden layer, Bengio recommends using an overcomplete configuration (ref):
For most tasks that we worked on, we find that an overcomplete (larger than the input vector) first hidden layer works better than an undercomplete one.
Hyperparameters Tuning
Note that the default configurations do well for most neural networks on most problems. Nevertheless, hyperparameter tuning is required to get the most out of a given model on a given dataset (ref).
Tuning hyperparameters can be challenging both because of the computational resources required and because it can be easy to overfit the validation dataset, resulting in misleading findings. You can think of hyperparameter selection as a difficult form of learning: there is both an optimization problem (looking for hyper-parameter configurations that yield low validation error) and a generalization problem: there is uncertainty about the expected generalization after optimizing validation performance, and it is possible to overfit the validation error and get optimistically biased estimators of performance when comparing many hyper-parameter configurations. (ref)
Three systematic hyperparameter search strategies are suggested. These strategies can be used separately or even combined (ref):
- Coordinate Descent. Dial-in each hyperparameter one at a time.
- Multi-Resolution Search. Iteratively zoom in the search interval.
- Grid Search. Define an n-dimensional grid of values and test each in turn.
The grid search is perhaps the most commonly understood and widely used method for tuning model hyperparameters. It is exhaustive, but the advantage of the grid search, compared to many other optimization strategies (such as coordinate descent), is that it is fully parallelizable. (ref)
Prof. Bengio suggests keeping a human in the loop to keep an eye out for bugs and use pattern recognition to identify trends and change the shape of the search space. Humans can get very good at performing hyperparameter search, and having a human in the loop also has the advantage that it can help detect bugs or unwanted or unexpected behavior of a learning algorithm. (ref)
A serious problem with the grid search approach to find good hyper-parameter configurations is that it scales exponentially badly with the number of hyperparameters considered. In other words, grid search is exhaustive and slow. Bengio suggests using a random sampling strategy, which has been shown to be effective. The interval of each hyperparameter can be searched uniformly. This distribution can be biased by including priors, such as the choice of sensible defaults. (ref)
The idea of random sampling is to replace the regular grid by a random (typically uniform) sampling. Each tested hyper-parameter configuration is selected by independently sampling each hyper-parameter from a prior distribution (typically uniform in the log-domain, inside the interval of interest). (ref)